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Abstract 


Periodic application of time-redundant error checking provides the trade-off between error 
detection latency and performance degradation. The goal is to achieve high error coverage while 
satisfying performance requirements. In this paper, we derive the optimal scheduling of check- 
ing patterns in order to uniformly distribute the available checking capability and maximize the 
error coverage. Synchronous buffering designs using data forwartfing and dynamic 
reconfiguration are described. Efficient single-cycle diagnosis is implemented by error pattern 
analysis and direct-mapped recovery cache. A rollback recovery scheme using start-up control 
for local recovery is also presented. 
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1. INTRODUCTION 

A variety of processor arrays have been proposed for signal and image processing and 
scientific computation applications [1]. In order to detect errors produced by faults in these ar- 
rays a variety of off-line testing procedures have been developed for detecting permanent faults 
and concurrent error detection (CED) techniques have been developed for transient and intermit- 
tent failures [2]. The focus of this paper is on processor arrays for systolic algorithms and the 
use of rime- redundancy techniques for concurrent detection of errors [3,17]. 

Traditionally, CED is applied continuously to each computation activity so that an error 
resulting from a fault in the processing element (PE) can be detected immediately. However, 
when rime redundancy techniques are used for error detection this continuous checking scheme 
may greatly degrade the array performance, e.g., by a factor of two for RESO [4] or alternating 
logic [5]. For some applications where high processing speed is crucial and error detection la- 
tency is tolerable, it may be possible to maintain the desired throughput while keeping a reason- 
ably high error coverage by turning the CED mechanism on and off periodically. Periodic Appli- 
cation of CED (PACED) offers such a trade-off in error latency and probability of error detec- 
tion versus performance degradation. 

Several techniques regarding the utilization of idle processor cycles for CED have been 
proposed [6-12]. For general-purpose machines with processor-level parallelism, a technique 
called saturation has been introduced for utilizing the idle processors to execute replicated ver- 
sions of tasks and employing majority voting to determine the output [6]. For processors with 
multiple pipelined functional units, like the Cray-1, RESO has been applied to the idle function 
units and was shown to equip the scalar unit with error checking capability at the cost of minor 
performance degradation [7]. Another recently proposed technique, called Available-Resource 
Control-flow monitoring (ARC) [8], is aimed at the resource parallelism of instruction-level 
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parallel processors such as superscalar and Very Long Instruction Word (VLIW) processors. The 
idle resources in these processors were utilized to detect the control-flow errors. 

In the area of systolic architectures, one approach has been developed to take advantage of 
the existing bypassing links in a reconfigurable array to pass the same input data to two adjacent 
PEs and then compare the outputs to do the error detection [9]. A control bit, called test token, 
was inserted periodically from outside and passed along the array to determine when a particular 
PE should invoke a duplicated operation on its neighbor. Related results using error checking 
code to achieve algorithm-based fault tolerance for a systolic sorter have also been developed 
[ 10 ]. 

The incorporation of CED capability in systolic arrays for band matrix multiplication has 
been developed with design parameters such as throughput latency, per-cycle PE utilization rate 
and I/O bandwidth [11]. The arrays were required to have a per-cycle PE utilization rate less 
than 50% in order to leave room for the RESO-based CED technique. Flexible designs were pro- 
posed [12] which allow the user to either employ the full throughput rate capability of the system 
or trade off the throughput rate for greater reliability. 

In the initial description of the general concept of periodic application of CED (PACED) 
[3] by Chen et al., error pattern analysis was performed only for a specific set of PACED param- 
eters and the actual implementation was not discussed. The major contribution of this current pa- 
per is that we start from a general formulation by defining a set of PACED parameters which are 
optimized to achieve the maximum error coverage and reduce the hardware cost. 

The PACED implementation considered utilizes the following properties: 

(1) Each PE is capable of performing time-redundant computation checking for itself as well as 
input code checking for the possibly erroneous output data produced and propagated by 
previous PEs, 
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(2) A single fault is present between the time of the initial fault occurrence and error detection. 

The processor arrays considered in this study arc unidircctional linear processor arrays con- 
sisting of Q processing elements with inputs entering from the top and left [1,13,14]. A PACED 
array driven by the original clock and equipped with the capability of concurrent error detection 
and error recovery is shown in Fig. 1. The control logic consists of circuitry to per- 

form buffering, diagnosis, rollback and start-up control. 

The outline of the paper is as follows: Section 2 establishes the system parameters; Section 
3 gives the optimization of system parameters with respect to various metrics; Section 4 pro- 
poses the reqtiired design changes for data buffering, error diagnosis and recovery; Section 5 
concludes the paper. 


2. SYSTEM PARAMETERS 

For two PEs in our processor array, PEj is upstream of PEj and PEj is downstream from 
PEi if i < j. PEs may not be identical, however each has approximately the same processing time 


recovery 



Figure 1. Block diagram of the PACED array with buffering and recovery caches. 
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SO that, without PACED, the array forms a balanced pipeline with clock cycle time equal to a 
time units. When CED is applied, each PE needs another b time units to perform error checking. 
For the purpose of preserving the synchronous nature of the original processor array, b is 
rounded off to multiples of a,b = ka. Therefore, for example, k = 1 corresponds to 100% time 
overhead. The entire activity applied to a certain set of data at each PE is called a confutation 
cycle with or without checking, as opposed to the physical clock cycle which always take a time 
units. At the begimting of a clock cycle, each PE reads from its local counter or a global counter 
the checking bit to determine whether it should perform the checking (1) or not (0). Checking 
patterns are the plots of checking bits as a function of computation cycle number as shown in 
Fig. 2. 



Figiu^ 2. Checking pattern as a function of computation cycle number. 
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The basic idea of PACED is to schedule the checking patterns with the same checking fre- 
quency among PEs. Therefore, while some PEs arc executing computation cycles with checking, 
some are not. The Corresponding Computation Cycles (CCCs) are defined to be all those cycles 
on different PEs which were originally executed at the same time in the array without CED. A 
task is defined to consist of all the activities applied to each input data by the processor array to 
obtain the corresponding output. The task path consists of all those cycles on different PEs at 
which a certain task is processed as it travels across the array. Each set of checking patterns is 
characterized by the following four parameters all in terms of computation cycles : 

(1) M : length of one period; 

(2) N : length of one checking burst; 

( 3 ) Om • offset between checking patterns for adjacent PEs; 

(4) Oj : initial offset (with respect to computation cycle 0) of the checking pattern for the first 
PE. The n umb ering of the computation cycles is shown in Hg. 2. 

While the checking pattern plot in terms of computation cycle as in Rg. 2 is used to illus- 
trate the idea of PACED, it is more convenient to use the Task/PE diagram shown in Fig. 3 for 
our analysis. In such a diagram, the checking patterns are adjusted so that each column 
corresponds to a single task path. The offset between adjacent patterns, J, becomes Om plus one. 
The Task/PE diagram will be used to analyze problems related to computation cycle such as Er- 
ror Detection Latency (EDL) analysis and diagnosis. 

It can be shown that the choice of Oi does not affect the analysis. Moreover, we will con- 
sider N and as two of the parameters instead of M and N. Therefore, the three parameters in- 
M 

N 

volved in the the optimization problem will be N, and Om- 



7 


J = Om + 1 
^ K- 



CCCs 


Task path 


Task number 


Figure 3. Task/PE diagram of the same array as in Fig. 2 with shaded squares 
representing computation cycles with checking. 

3. OPTIMIZATION OF SYSTEM PARAMETERS 


3.1. Performance Analysis 

N 

It is intuitive that — is a measure of the checking frequency and, therefore, a measure of 
M 

rime, overhead. However, because of the imbalance introduced by PACED, the wait time 
between the output of the upstream PE and the input of the downstream PE of each adjacent pair 
constitutes another time overhead in addition to the checking time. The total execution time of a 
certain task is then given as 

Total execution time = computation time + checking time + wait time 
Fig. 4(a) shows the task waiting pattern with each arrow starting from the time the 
upstream PE outputs the data and pointing to the time the downstream PE reads in the data. The 
wait time after computation cycle j, Wait(j), in terms of number of clock cycles has the follow- 
ing dependence on as well as on N and M and is shown in Fig. 4(b). 
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Om X k 


Wait(j)= < 


(N-j-l)xk 

0 


k (OM~M+j + l)xk 


N-OM^j^N-1 


Therefore, fixing — does not necessarily fix the time overhead. However, the most impor- 
M 

tant performance measure for a processor array is the througfq>ut instead of the total execution 
time of each single task. As shown in Fig, 4(a), although the data has to wait for the downstream 


Om tasks (N-Om) tasks Om tasks 



Clock cycles 


(a) 

Wait time 



M-Om M-1 0 1 2 N-Om -N-1 N N+1 

computation cycle number 


(b) 

Figure 4. (a) Task waiting pattern between adjacent PEs (b) wait time as a function of 
computation cycle number. 
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PE to become free, the PEs are always kept busy. Therefore, the processor array will produce M 
outputs for every (M+Nxk) clock cycles and the throughput can be calculated as 


throughput = 


M 


(M+Nxk)a 

M 


which is independent of Om- 


For N = 0, the PACED array reduces to the original array without CED which has the 
highest throughput — but no error detection capability. For N = M, PACED reduces to continu- 
ous checking which has the lowest throughput \ but can detect errors without latency. 

(1 -t- k)a 

The problem of optimal scheduling for PACED is then formulated as: given a throughput re- 


quirement 


1 


(l+-^xk)a 

M 


, how to choose Om to minimize the error detection latency and maxim- 


ize the error coverage. 


3.2. Potentially Infinite Error Detection Latency 

There are two cases in which transient faults occurring at certain cycles will have infinite 
error detection latency (EDL), which means that the errors will escape with 100% probability. 
Case 1 : Improper choice of the values for M, N and Om- 

According to the task/PE diagram in Fig. 3, as a task travels through the array, it can be 
viewed as advancing in the checking pattern with a speed of J computation cycles per PE where 
J = Om + 1- If we can make sure that at least one cycle with checking, i.e., 0 ^ j ^ N-1, is on the 
path of each task, we can prevent the infinite EDL from occurring under the assumption that er- 
rors can not be masked during the propagation. (Error masking is considered in Section 3.5.) The 
following Lenuna 1 for proving Fermat’s Little 'ITieotem [15] is used to prove Theorem 1. 
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LEMMA 1. The M numbers 0 mod M, J mod M, 2J mod M, , (M— 1)J mod M con- 
sist of precisely d copies of the M/d numbers 0, d, 2d, ... , (M/d - l)d where d = gcd(M J) 
(gcd stands for greatest common divisor.) 

LEMMA 2. With the same notation in Lemma 1, define the remainder set Ry = 
{V+nd| 0 ^ n ^ M/d-1 } for 0 ^ V ^ d-1, then 

(1) (j+ixJ) mod Me Rj ^odd 0 ^ j ^ and all non-negative integer i. 

(2) {(j+ixJ) mod M 1 0 i ^ M-1 } = Rj mod d» where d = gcd(M, J). More precisely, (j + i x J) 
mod M , 0 ^ i ^ M - 1 contain d copies of each of the elements in Rj mod d- 

Proof. See Appendix. 

THEOREM 1. Except for some faults occurring in the last (M-1) PEs of the array, all 
transient faults have finite EDL (less than M) if and only if gcd(M J) = d ^ N. 

Proof. By the task/PE diagram in Fig 3, a task entering PEG at cycle j will be processed by 
PEi at cycle ( j + i x J ) mod M. By Lemma 2(1), ( j + i x J) mod M e Rj mod d- If N < d, for any 
j such that N ^ j < d, every integer in Rj mod d Is greater than N - 1, hence for all non-negative in- 
tegers i, (j + i X J ) mod M > N - 1, which means a task enters PE 0 at such cycle j will never be 
checked, resulting in infinite EDL if it is affected by some transient fault. 

Conversely, if d ^ N, (j mod d) ^ N - 1. By Lemma 2(2), we have { ( j + i x J) mod M 1 0 ^ i 
^ M - 1 } = Rj mod d> which means that an erroneous task produced at any cycle j will be checked 
at least once at cycle (j mod d) by the faulty PE itself or one of its (M-1) immediate downstream 
PEs. Therefore, as long as the faulty PE is not one of the last (M-1) PEs of the array, the error 
will be detected. □ 

Case 2 ; Faults occurring near the end of the array. 
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As long as the CED is not applied continuously, EDL must exist. When a transient fault oc- 
curs at some PE in a computation cycle with EDL larger than the number of the downstream PEs 
ftom it, the error will escape. This phenomenon exists no matter how we schedule the checking 
patterns (only the severity varies). One possible solution is to add a code checker with lower 
complexity and higher reliability at the end of the array to perform continuous code checking in 
order to intercept the escaping errors if desired. 

3.3. Uniform Distribution of Checking Capability 

N 

When the checking frequency is set to — , it is just an average over all tasks and does not 

N 

necessarily guarantee that each task will be processed with checking of the time along its 

path. Since all tasks have equal significance, it is desirable to schedule the checking patterns 
such that each task is treated as uniformly as possible. Theorem 2 gives the condition for this 
purpose. 

THEOREM 2. Except for the difference due to the fact that array length Q might not 
be a multiple of M, the checking capability is uniformly distributed among all tasks if 
gcd(MJ) = d divides N. 

Proof. If N = m d, where m is a positive integer, for every Rv,0^V^d - l,the first m ele- 
ments V + nd,0^n^m - l,are less than N and others are not. By Leimna 2, (j + i x J) mod M, 
0 ^ i ^ M - 1 contain d copies of each of the elements in Rj modd. which implies that a task enter- 
ing PEo at any cycle j will be checked m x d times among M computation cycles. Therefore, all 

tasks are fairly treated and checked with the same frequency □ 

M M 
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3.4. Minimization of Maximum Error Detection Latency 

The EDL(j) of both permanent and transient faults occurring at computation cycle j with 
checking (0 ^ j ^ N-1) arc defined to be zero in terms of number of computation cycles because 
they are detected immediately. The EDL(j) of other non-checking cycles under the constraint 
that 1 ^ J ^ N are given by Lemma 3 with superscript "C" indicating the error is detected as a 
computation error and ’T" stands for input error. We can prove the optimal solution under this 
constraint actually achieves the optimality for general values of J. For the rest of this paper, we 
will assume the duration of a transient fault is much less than the clock cycle time so that it can 

only affect the result of one computation^ 

LEMMA 3. For 1 ^ J S N and N ^ j ^ M-1, 

(1) under transient faults ; 

EDL^a)= EDL<=(j) = ~, 

(2) under permanent faults ; 

EDL^(j)= ; EDL^0 = M-j. 


Proof. (1) Because we can view a task traveling across the processor array as advancing on 
the checking pattern with J cycles per step, we have the following formula for the EDL of a tran- 
sient fault for general J: 



where r = 



|0^(j+ixJ)modM^N-l, 



To our knowledge, there is no closed-form solution for this general problem. The constraint 


1 ^ J ^ N will make sure that an error caused by a fault occurring at a non-checking cycle be 

’Most of the results will still be valid without this assumption except for more complicated error pattern analysis described 

later. 
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captured by the next checking burst (r = 1) of some downstream PE. Therefore, 

EDL^(j)= 

Under the assumption about the length of the transient faults, a PE can not detect such faults oc- 
curring at non-checking cycles, so the EDL^(j) is infinity. 

(2) A permanent fault can be considered as consisting of a large number of transient faults 
occurring at consecutive computation cycles. Using the above formula, we have, for permanent 
fault. 



EDLi0= ^ 

For a permanent fault starting at a non-checking cycle, the faulty PE will detect it as a computa- 
tion error as soon as it enters next checking burst Therefore, EDL^(j) = M — j. □ 

LEMMA 4. For 1 ^ J ^ N and N j ^ M-1, if we define EDL(j) to be the latency until 
the first error indication (computation or input error), then for both permanent and tran- 
sient faults : EDL(j) = 



Proof. This follows immediately from Lemma 3, because 


< <» and 


where equality holds for J = 1 or M - j = 1. Hence, 
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EDL(j) = mim 


EDL^(j).EDL^O) U EDL^O) = 


ln%- 


M-j 


. □ 


Next we give some definitions for proving the optimal scheduling. 

DEFINITION 1. Given a sequence of n numbers S = (si, S 2 , ,Sn), 0(8) is defined to 

be a nondecreasing permutation of S, i.e., II(S) = (jCi(S), *2(8), , *n(S)) is a permutation 

of S where JCi (S) ^ 712(8) ^ ^ Jtn( 8 ). 

DEFINITION 2. Given two sequences of n numbers S and T, we define if Si<tj for 
1 ^ i ^ n. 


LEMMA 5. Define Ej = (eu, e 2 j, ... , C(m-n)J ) = (EDLj(M-l), EDLj(M-2), ... , EDLj(N)) 
for O^^-l, then 


(1) n(Ej) = Ej forl^J^N 

(2) Ej^Ej+i forl^J^N-1 

(3) FKEn) ^ n(Ej) for all integer J. 


Proof. (1) By Lemma 4, 


hence 


ey = EDLj(M-i) = 


M - (M-i) 


i 

J 


J 


for 1 ^ J ^ N and 1 ^ i ^ M-N ; 


( 1 ) 


ey ^ ejj for 1 ^ i <: j ^ M— N 
n(E,) = E, . 

(2) By Eq. (1) we also have ey ^ ej(j+i) for 1 ^ i ^ M— N; hence 

Ej ^j+i for 1 ^ J ^ N— 1 

(3) By (1) and (2), we have proved I1(En) ^ n(Ej) for 1 ^ J ^ N. For general values of J, the 
proof is by contradiction. Suppose there exists J such that HCEn) ^ Il(Ej) is not true. This 
means there eTUSts i, 1 ^ i ^ M-N such that 7ti(EN)>7Ci(Ej) and 
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7ti(EN>-l>7Ci(Ej)^ ••• ^Jti(E,)>l. 


(2) 


Because jciCEn) = CjN = 


i 

and — 1 > 

i 

N 

N 

N 


by definition, we have — + 1 > tCiCEn) and 

N 


->N. 

JIiCEn)-! 


(3) 


Hence, by Eqs. (2) and (3), there must exist m, 1 ^ m ^ tCiCEn)-! such that more than N ele- 
ments of {Jti(Ej), • • • , 7 Ci(Ej) } are equal to m. However, for a checking burst of length N, the 
maximum number of non-checking cycles with the same EDL is N for any of the EDL values. 
Therefore, we have reached a contradiction and II(En) ^ H(Ej) for all J. □ 

THEOREM 3. The maximum EDL, EDLf®, is minimized by setting J = N and the 


minimum value is 


M-N 

N * 


Proof. By (3) of Lemma 5, EDLf** = tcm-nCEj) ^ tcm-n(En) = EDLn” for all integer J. 
Hence J = N minimizes the maximum EDL and 


EDL{5“ = max 
NSj<M 


M-j 


M-N 

N 


N 


□ 


3.5. Minimization of Error Escape Probability 

The price paid by PACED to maintain the desired throughput is lower error coverage. The 
longer the error detection latency, the larger the possibility that an error will be masked during 
the propagation and escapes. Assume the probability that an error will be masked at any compu- 
tation cycle is Pm- The error escape probability, PescO). for ^ transient fault occurring at compu- 
tation cycle j is then given by PescCj) = 1 - (1 - 0 ^ j ^ M-1. 

THEOREM 4. The average error escape probability is minimized by setting J = N. 
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Proof. Assume a transient fault occurs at each cycle j, 0 ^ j ^M— 1, with equal probability. 
The average error escape probability is 

because EDL0 = 0 for 0 ^ j ^ N— 1. By (3) of Lenama 5, 

= ^ 2 [l-(l-Pm)^^^1 for all integer J. 

M J 

Hence, setting J = N minimi zes the average error escape probability. □ 

N 

By Theorems 3 and 4, we conclude that for a given — — , J should be set equal to the length 

M 

of the checldng burst N, i.e., the pattern offset between adjacent PEs should be Om = N - 1, in 
order to minimiz e the m aximum EDL and maximize the error coverage. This optimality is in- 
dependent of the choice of N. 


3.6. Summary of Optimization Results 

N 

Given a fixed checking firequency — , we choose M and N to be relatively prime in order 

M 

to minimize both M and N. Minimizing M allows Theorem 2 to more accurately state the condi- 
tion for uniform distribution of checking capability. We will show in the next section that 
minimizing N can minimize the hardware overhead for data buffering. Since J should be equal 
to N for optimal error detection, we have gcd(M J) = gcd(M = 1 which satisfies both condi- 
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tions in Theorems 1 and 2. Therefore, the possibility of infinite error detection latency is elim- 
inated and the checking capability is uniformly distributed among all tasks. 


4. DESIGN CHANGES 


4.1. Synchronous Buffering Design 

By scheduling checking patterns among PEs, resource (PE) conflict may occur when a PE 
is still checking old data but new data has been produced by its upstream PEs. It was shown in 
Fig. 4(b) that the maximum wait time is equal to O^xk clock cycles. Hence, it is adequate to in- 
sert Of^xk buffers between each adjacent PE pair, driven by the original clock^. Since Om 
should be equal to N-1 for optimal scheduling, the number of buffers will decrease as N de- 
creases. Therefore, choosing M and N to be relatively prime also minimizes the hardware over- 
head for data buffering. 

However as also shown in Section 3.1, the wait time, which determines the number of 
needed buffers, is not a constant but a function of computation cycle number. It becomes clear at 
this point that some kind of dynamic buffering technique has to be used to make the pipeline 
flow smoothly and correctly. We propose two such approaches, namely, data forwarding and 
dynamic reconfiguration. 

The data forwarding approach to buffering is described as follows. When a PE is ready to 
output the piocessed data, the wait time logic shown in Fig. 5, which monitors the checking bit 
sequence, has determined the wait time, Wait(j), of current cycle and connected the PE output to 
the buffer which is Wait(j) stages away fiom the downstream PE. Once the data is placed into 




can be shown that the minimum number of required buffers is equal to 


OMXk 

k+1 


. However, more complicated 


control circuits are needed to reuse the buffers. 
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Figure 5. Buffering by data forwarding 

the appropriate buffer, the synchronous buffering design will ensure the data arrives at the down- 
stream PE at the correct clock cycle. 


As an alternative, since the number of buffers needed varies with time, we can treat the ex- 
tra buffers at each clock cycle as being "faulty" and use the Diogenes approach [16] to dynami- 
cally reconfigure the "buffer arrays" by bypassing the "faulty" ones. A shifter clocked by the fal- 
ling edge of the clock (assume the PEs are clocked by the rising edge) is used to set up the 
proper configuration of the buffers for the next data movement The basic rule is : 


checking bit of PEi 



Figure 6. Dynamic reconfiguration circuit for buffering using Diogenes approach 
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(1) Include one more buffer if the downstream PE is in a computation cycle with checking 
while the upstream one is not; 

(2) Bypass one more buffer if the upstream PE is in a computation cycle with checking while 
the downstream one is not; 

(3) Maintain the current configuration (by disabling the clock input of the shifter) if the two 
PEs are both checking or non-checking. 

Because of the regular pattern of wait time variation (Fig. 4(b)), the reconfigiu^tion circuit 
is very simple, as shown in Fig. 6. An example showing the correct buffering at each clock cycle 
by using the reconfiguration approach is given in Fig. 7. 

4.2. Diagnosis 

Because permanent and transient faults occurring at different computation cycles will result 
in different combinations of computation and input error indications after various length of la- 
tency, it is important for diagnosis to analyze all possible error patterns and classify the faults 
into several categories according to their resultant error patterns. 

Since the error escape probability is related to EDL in terms of computation cycles, and in 
order to make the diagnosis procedure independent of the checking overhead k, we will use the 
task/PE diagram (Fig. 3) and the error indications from CCCs for error pattern analysis. How- 
ever, for a PACED array, the CCCs do not happen at the same clock cycle. It would be unac- 
ceptable if we have to wait for all the CCCs to finish before the analysis because that will delay 
the diagnosis and rollback by a considerable number of clock cycles. The following proof gives 
the upper bound for the number of error indications by any set of CCCs for 2 ^ J < N. (The case 
when J = 1 will result in peculiar error patterns which can not share the same diagnosis and roll- 
back procedures with other choices of J. Since J = 1 also results in large EDL, it will be excluded 
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Figure 7. Example of dynamic reconfiguration for buffering. The parameters are N = 4, 
Om = 3 and k = 1. When Q is 1, the corresponding "faulty" buffer is bypassed. 
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from future discussion.) 

THEOREM 5. For 2 ^ J ^ N, at most two PEs will have the earliest error indications 
among all the CCCs for a single fault, and they must be adjacent 

Proof. Since a transient fault can only create one erroneous task, only one PE will detect it. 
For a permanent fault occurring at cycle j, N ^ j < M, the faulty PE itself will detect a computa- 
tion error after M-j cycles and the erroneous task produced at cycle j + q, 0 ^ q < M-j will be 


detected 


M-(j4q) 

J 


-iq cycles after the fault occurrence as an input error by 


m-G-ki) 


th 


downstream PE from the faulty PE. For J > 2 and q ^ 2, we have 


M-G+q) 

+q= 

M-j-KJ-l)q 

> 

‘ -<2 


M-j 

J 

J 


J *1^ 


J 


which is larger than the EDLG) 



. Hence, the only erroneous tasks which will possibly 


be detected as the earliest input error indications are the ones produced at cycle j and j+1. 

If M— j = 1, the two earliest error indications with EDL(M— 1) = 1 are the computation error 
detected by the faulty PE and the input error detected by the immediate downstream PE. If 


M— j > 2, (M— j) > 


M-j 

J 


for J > 2, the two possible earliest error indications are both input er- 


rors detected by two adjacent PEs executing at CCCs ( 


M-j 


M-G+1) 

J 


J 


= 0 or 1 for J ^ 2). 


□ 


Therefore, in order to design a diagnosis procedure, we will always keep the pipeline flow- 
ing until the inomediate downstream PE finishes the CCC once the first error indication is raised 
by some PE (called the detective PE). The number of clock cycles that the detective PE has to 
wait is equal to the wait time of the current cycle because when the downstream PE is ready to 
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process the data corresponding to the task resulting in the first error indication, it must have 
finished processing the previous task at the CCC and setup the error flags. 

The next step is to classify all the faults according to their resultant error patterns. The no- 
tation is defined as: Qass a.b where a = 1 means transient, a = 2 means permanent and b is the 
further classification within each category. The corresponding error patterns are shown in Fig. 8. 
"P" stands for permanent fault, "T* for transient fault and "F" can be either "P" or "T". "I" indi- 
cates an input error and "C" represents a computation error. 

(1) Qass 1.1 and 2.1: For both permanent and transient faults. Fig. 8(a) represents the case 
where a computation error indication occurs at computation cycle j, 1 ^ j ^ N— 1. The fault 



Figure 8. Error pattern analysis. (The thick line passes through all CCCs which are 
related to the detection of the present fault.) 
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must have just occurred at the detective PEj otherwise, it should have been detected at ear- 
lier cycle with checking. In order to distinguish between permanent and transient faults, the 
faulty PE is given a second chance to do the recomputation. If the recomputation still sets 
an error flag, the fault is permanent under the assumption that a transient fault never affects 
more than one computation; otherwise, it is transient 

A more complicated situation occurs when the computation error flag is raised at cycle 0 
because it is possible that the fault is a permanent one which occurred during previous 
non-checking cycles. However, the proof of Lemma 4 shows that the input error will be 
detected no later than the computation error for such faults and the two kinds of error will 
be in the CCCs if and only if such faults occurred at cycle M-1. Therefore, if there is no in- 
put error indication in the immediate downstream PE, as in Fig. 8(b), the fault must have 
just occurred and the faulty PE is given a second chance. 

(2) Class 1,2: A transient fault occurring at cycle j with N ^ j ^ M— 1 will be detected as an in- 
put error by the downstream PE which is EDL(j) stages away from the error source PE (Fig. 
8(c)). 

(3) Class 2,2: A computation error detected at cycle 0 and an input error detected by the im- 
mediate downstream PE at CCC indicates the fault is permanent and occurred at cycle (M- 
1) in the upstream PE (Fig. 8(d)). 

(4) Class 2-3: A permanent fault at cycle j with EDL(j) = EDL(]-h 1), N<j <M— 2 will be 
detected by a single downstream PE as an input error (Fig. 8(e)). 

(5) Class 2,4: A permanent fault at cycle j with EDL(j) = EDL(j+l) + 1, N j ^ M-2 will be 
detected by two downstream PEs as input errors (Fig. 8(f)). 
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Among the classes of faults defined in the previous paragraph. Class 1.1, 2.1 and 2.2 are 
successfully identified by the error pattern analysis and Class 1.2, 2.3 and 2.4 need further diag- 
nosis. The basic idea is to design a recovery cache for each PE for storing recent input data. 
When an input error is detected by some PE, each upstream PE suspected of producing the error 
reads firom its recovery cache the input corresponding to the erroneous task and uses it as test in- 
put to perform lecomputation with checking. The computation and input error flags resulting 
from these recomputations are used as syndromes and will uniquely identify the faulty PE and 
cycle under the assumption of a single fault. Therefore, the diagnosis takes only one computation 
cycle with checking. Again, for the regularity and simplicity of the diagnosis, the following rules 
are adopted: 

(1) Although it is possible to calculate the exact number of suspects which is less than or equal 
to the maximum EDL, for each input error detected we will always use maximum EDL as 
the number of suspects. Because the diagnosis procedure for each PE is done in parallel, 
this does not increase the time overhead and allows regular hardware connection. 

(2) For the faults in Class 2.4, we will ignore the first input emn* and only use the second error 
indication for further diagnosis because the second one corresponds to the erroneous task 
produced earlier. 

The success of the above simple diagnosis procedure depends on the capability of each PE 
to retrieve the correct data from the recovery cache. Because the CCCs are skewed in a PACED 
array, it is very difficult to determine in which location of the cache the required data resides. 
The design of the direct-mapped recovery cache is aimed at simplifying the searching procedure. 
Similar to the direct-mapped cache design in the memory hierarchy for general-piupose comput- 
ers, where each position of the cache can only hold data from certain addresses with identical 
least significant bits, each position i of our direct-mapped recovery cache can only hold data for 
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those tasks with id number n such that n mod (cache size) = i. Consequently, as long as we have 
a recovery cache of sufficient size, i.e. larger than the maximum EDL, so that each data will not 
have been overwritten by the data from later tasks when it is needed for diagnosis, every 
suspected PE only has to read the test input from the same position as that in the detective PE 
and the hardware connection is simplified. 

Start-up control is a mechanism to setup the cache correctly once and for all when the pipe- 
line starts flowing, so that whenever new data has to be placed into the cache, it is put into the 
next position or, when reaching the end, the first position. Fig. 9 shows how the start-up control 
works. The start-up delay for each PE is computed by accumulating the wait times. Each PE can 
only start reading in the data after the start-up delay. Therefore, the first data each PE places in 
the recovery cache will be for task 0 and later data can simply follow on top. The start-up control 
is also utilized for rollback which is discussed next 


PE^ 
P^ 
PE3 
PE. 


0 1 

2 

3 

4 

5 

5 

6 

6 

7 8 

0 

1 

2 

3 

3 

4 

4 

La 

6 7 

^ h- 









SDl 

0 

1 

1 

2 

2 

3 

4 

5 6 

H SD2 1 

l-i- 









r 


0 

0 

1 

2 

3 

4 5 

K— SD3 — H 







L 

0 

1 

2 

3 4 


SIM 


aock cycles 


Figure 9. Start-up control (SD: start-up delay) 
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4.3. Rollback 

Once the faulty PE and faulty cycle are identified, if the fault is permanent, a spare PE is 
brought in to replace the faulty one, and then rollback recovery starts after the reconfiguration; if 
the fault is transient, rollback directly follows diagnosis, or actually, overlaps with it because the 
recomputation in diagnosis can be used as the first step in rollback. 

The rollback procedme can be divided into two steps: flushing and local recovery. Similar 
to the simplification in diagnosis, since the rollback is done in parallel for each PE, we will use 
the same procedure for the recovery of both permanent and transient faults even if the latter 
results in fewer number of erroneous data. 

(1) Flushing: First, we define the erroneous h/ock to consist of all the following data : (1) data 
inside the PEs and buffers between error source PE and detective PE; (2) data inside the 
recovery cache between these two PEs, from the position containing the data corresponding 
to the erroneous task up to the most recent position. The region enclosed by dash lines in 
Fig. 10(a) shows the erroneous block for the case where a transient fault occurred in PEI 
when task number 7 was being processed and is detected by PE4 as an input error. The first 
step of rollback is to flush all the data in the erroneous block as shown in Fig. 10.(b). 

(2) Local recovery: Since the fault only affects the PEs between the error source PE and the 
detective PE inclusive (called the local recovery set), the portion of the pipeline containing 
all the other PEs are frozen during the rollback. The local recovery line is defined to consist 
of all Ae PEs in Ae local recovery set wiA Ae erroneous task number. The local recovery 
scheme is to apply Ae start-up control to Ae local recovery line by viewing Ae local 
recovery set as a short pipeline, Ae erroneous task as Ae first task and Ae data in Ae 
recovery cache of Ae error source PE, which is correct and Aus not flushed, as Ae input 
data. The erroneous block is rebuilt as shown in Fig. 10(c)-(h) after which all PEs proceed 
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Figure 10. Local recovery procedure (a) fault occurrence and error detection; (b) data 
flushing; (c)-(h) local recovery using start-up control; (i) resumption of normal 
processing. 


as before (Rg. 10(i)). 


4.4. Summary of Design Changes 

A PAGED array equipped with the proposed design changes was shown in Fig. 1. Error 
checking circuits are built into each PE for time-redundant computation checking and input code 
checking. A code checker is added at the end of the array as discussed in Section 3.2. With op- 
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timal scheduling of checking patterns and 100% overhead time-redundant checking (k=l), N-1 

M-N 


buffers are inserted between each adjacent PE pair and a recovery cache of size 2x 


N 


IS at- 


tached to each PE for storing incoming data from the top and the left. The control logic is 
responsible for correct data buffering, start-up control, error diagnosis and local recovery. The 
techniques described in this paper have been simulated on an Alliant multiprocessors with eight 
processors to show their correct operations. 


5. CONCLUSIONS 

It was shown that, for a PACED array with the period of checking pattern equal to M com- 
putation cycles, the length of checking burst equal to N computation cycles and fixed throughput 

N 

(determined by the checking frequency - 77 ), the optimal scheduling in terms of minimizing the 

maximum error detection latency and error escape probability is achieved by setting the check- 
ing pattern offset Om to N - 1. Also, by choosing M and N to be relatively prime, the hardware 
overhead for data buffering is minimized and the checking capability is uniformly distributed 
among the tasks. 

Dynamic buffering techniques to preserve the systolic nature and the implementation for 
rollback recovery under faults were presented. It was shown that the complexity in the diagnosis 
and recovery process, resulting from the error latency as a trade-off for performance, can be re- 
duced through the use of direct-mapped recovery cache and start-up control. The design flexibil- 
ity can be further improved by using a programmable control unit 
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APPENDIX 

Pr<3o/ of Lemma 2. 


(j + i X J) mod M = (j + (i X J) mod M) mod M 

= (j + (i mod M X J) mod M) mod M 


es 


(j + nd) mod M |0 ^ n ^ M/d— 1 


>■ by Lemma 1. 


Also, {O'+ixJ) mod M|0^i:SM-l } = {(j+nd) mod M|0<ii<M/d-l }, and (j+ixj) mod M, 
0 ^ i ^ M— 1 contain d copies of each element in the set on the right hand side. 

ForO^n^M/d-y/dJ -1, 


=>0 ^j + nd^M + j modd-d<M 

^(j+nd) mod M = j + nd = j mod d + (n + |j/dj)d = j mod d + md , [j/dj m < M/d— 1. 

For M/d- y/dj ^n^M/d- 1. 

=»M + j mod d^j + nd^M + j- d< 2M 

^(j + nd) mod M = j + nd - M = j mod d + (n + [j/dJ— M/d)d = j mod d + md, 0 ^ m ^ y/dJ - 1. 
Therefore, 


(j+nd) mod M |0 < n ^ M/d-1 ?•= 


(j mod d + md) |0 ^ m ^ M/d-1 


Finally, we have (j + i x J) modM e Rj mod d and { (j+ixJ) mod M 1 0^<M— 1 ) = Rj mod d- CH 
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